Top-down Extraction of Semi-Structured Data

نویسندگان

  • Berthier A. Ribeiro-Neto
  • Alberto H. F. Laender
  • Altigran Soares da Silva
چکیده

In this paper, we propose an innovative approach to extracting semi-structured data from Web sources. The idea is to collect a couple of example objects from the user and to use this information to extract new objects from new pages or texts. We propose a top-down strategy that extracts complex objects decomposing them in objects less complex, until atomic objects have been extracted. Through experimentation, we demonstrate that with a small number of given examples our strategy is able to extract most of the objects present in a Web source given as input.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bottom Up and Top Down - Twig Pattern Matching on Indexed Trees

This article describes how to implement efficient memory resident path indexes for semi-structured data. Two techniques are introduced, and they are shown to be significantly faster than previous methods when facing path queries using the descendant axis and wild-cards. The first is conceptually simple and combines inverted lists, selectivity estimation, hit expansion and brute force search. Th...

متن کامل

Evaluation of Top-k Queries over Structured and Semi-structured Data

Evaluation of Top-k Queries over Structured and Semi-structured Data

متن کامل

Hyperset approach to semi-structured databases and the experimental implementation of the query language Delta

This thesis presents practical suggestions towards the implementation of the hyperset approach to semi-structured databases and the associated query language ∆ (Delta). This work can be characterised as part of a top-down approach to semi-structured databases, from theory to practice. Over the last decade the rise of the World-Wide Web has lead to the suggestion for a shift from structured rela...

متن کامل

Learning dialogue structures from a corpus

This paper demonstrates some aspects of a plan processor which is a subcomponent of the dialogue module of verb-mobil. We describe how we transfer results from the research area of grammar extraction for the semi-automatic acquisition of plan operators for turn classes. We exploit statistical knowledge acquired during learning the grammar and incorporate top down predictions to enhance the corr...

متن کامل

Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach

Structured community portals extract and integrate information from raw Web pages to present a unified view of entities and relationships in the community. In this paper we argue that to build such portals, a top-down, compositional, and incremental approach is a good way to proceed. Compared to current approaches that employ complex monolithic techniques, this approach is easier to develop, un...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999